MLP(Multi-Layer Perceptron)

mlp

原理解读

  MLP(Multi-Layer Perceptron):多层感知机**本质上是一种全连接的深度神经网络(DNN)**,感知机只有一层功能神经元进行学习和训练,其能力非常有限,难以解决非线性可分的问题。为了解决这个问题,需要考虑使用多层神经元进行学习。

核心思想

反向传播(BP, BackPropagation)

神经网络的学习过程可真是太牛B了,人类的学习过程是,从小学到大学,学习知识以后,都需要考试,然后根据得分修正以往的错误部分。BP的思想也是一样,一开始神经网络什么都不知道,给予网络随机的权重,然后进行学习,每次得到一个结果,我们将其与标准结果进行对比(这类似于考试成绩与标准答案进行对比)。然后根据差异寻找问题的根源。

BP算法基于梯度下降策略,以目标的负梯度方向对参数进行调整。
下面我举一个简单的例子,小伙伴们就可以清晰的知道BP的原理。
假设有N个样本,样本的特征数为F,因此输入矩阵的大小为[F, N],只有一个隐藏层,且神经元的个数为F1,因此w1的大小为[F1, F],b1的大小为[F, 1],经过计算后z1的大小为[F1, N],激活函数为ReLu,第一层的输出a1的大小为[F1, N],对于一个二分类问题,输出神经元的个数为1,因此w2的大小为[1, F1],b2的大小为[1, 1],经过计算后z2的大小为[1, N],激活函数为Sigmoid,第二层的输出$\hat{y}$的大小为[1, N]。

我们使用以下记号表示前向计算过程
$$z1 = w1 \cdot x + b1$$
$$a1 = ReLu(z1)$$
$$z2 = w2 \cdot a1 + b2$$
$$\hat{y} = Sigmoid(z2)$$
$$J(\hat{y}, y) = -[y \cdot log(\hat{y}) + (1 - y) \cdot log(1 - \hat{y})]$$

下面推导误差反向传播过程,我们只要计算出各个参数的梯度即可

$$\frac{\partial J}{\partial \hat{y}} = -(\frac{y}{\hat{y}} - \frac{1 - y}{1 - \hat{y}})$$
$$\frac{\partial J}{\partial z2} = \frac{\partial J}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z2} = -(\frac{y}{\hat{y}} \hat{y}(1 - \hat{y}) - \frac{1 - y}{1 - \hat{y}} \hat{y}(1 - \hat{y})) = \hat{y} - y$$
$$\frac{\partial J}{\partial w2} = \frac{\partial J}{\partial z2} \cdot \frac{\partial z2}{\partial w2} = \frac{\partial J}{\partial z2} \cdot a2^T$$
$$\frac{\partial J}{\partial b2} = \frac{\partial J}{\partial z2} \cdot \frac{\partial z2}{\partial b2} = \frac{1}{N}\sum_{1}^{N}\frac{\partial J}{\partial z2}$$
$$\frac{\partial J}{\partial a1} = \frac{\partial J}{\partial z2} \cdot \frac{\partial z2}{\partial a1} = w2^T \cdot \frac{\partial J}{\partial z2}$$
$$\frac{\partial J}{\partial z1} = \frac{\partial J}{\partial a1} \cdot \frac{\partial a1}{\partial z1} = \frac{\partial J}{\partial a1} * dReLu(z1)$$
$$\frac{\partial J}{\partial w1} = \frac{\partial J}{\partial z1} \cdot \frac{\partial z1}{\partial w1} = \frac{\partial J}{\partial z1} \cdot a1^T$$
$$\frac{\partial J}{\partial b1} = \frac{\partial J}{\partial z1} \cdot \frac{\partial z1}{\partial b1} = \frac{1}{N}\sum_{1}^{N}\frac{\partial J}{\partial z1}$$

代码实战

mlp_train.m

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
clear;clc;
%激活函数
f_leaky_relu=@(x)max(0.01*x,x);
f_sigmod=@(x)1./(1+exp(-x));
df_leaky_relu=@(x)max(0.01*x,x)./x;
%学习率
learn_rate=0.01;
%输入x的矩阵
train_x=[0.6,0.1,0.1,0.4,0.8,0.5,0.9,0.1,0.5,0.9,0.5,0.4,0.5,0.6;...
0.4,0.1,0.8,0.6,0.1,0.6,0.9,0.5,0.1,0.5,0.9,0.4,0.4,0.6];
% x=[0.8,0.1,0.4;...
% 0.6,0.1,0.8];
%输入y的矩阵,y为行向量
train_y=[1,0,0,1,0,1,0,0,0,0,0,1,1,1];
% y=[1,0,1];
%样本数
train_num=length(train_y);
%特征数目
feat_num=size(train_x,1);
%各层的节点数
node=[200,1];
%网络层数
network_num=length(node);
%w的矩阵大小,wi,i+1=A 维度为node(i)*node(i-1)
max_col=max([feat_num,node(1:end-1)]);
max_row=max(node);
w(:,:,:)=zeros(max_row,max_col,network_num);
for i=1:network_num
if i==1
w(1:node(i),1:feat_num,i)=rand(node(i),feat_num)*0.1;
else
w(1:node(i),1:node(i-1),i)=rand(node(i),node(i-1))*0.1;
end
end
dw(:,:,:)=zeros(max_row,max_col,network_num);
%b的矩阵大小,bi,i+1=A 维度为node(i)*1
b(:,:,:)=zeros(max_row,1,network_num);
db(:,:,:)=zeros(max_row,1,network_num);
%z的矩阵大小,zi,i+1=A 维度为node(i)*sample_num
z(max_row,train_num,network_num)=0;
dz(max_row,train_num,network_num)=0;
%a的矩阵大小,ai,i+1=A 维度为node(i)*sample_num
a(max_row,train_num,network_num)=0;
fprintf('开始训练...\n\n');
for times=1:5000
%% %forward propagation
for i=1:network_num
b_broadcast=zeros(node(i),train_num);
for j=1:train_num
b_broadcast(:,j)=b(1:node(i),1,i);
end
if i==1
%z1=w[1]x+b[1]
z(1:node(i),1:train_num,i)=w(1:node(i),1:feat_num,i)*train_x+b_broadcast;
else
%z[n]=w[n]a[n-1]+b[n]
z(1:node(i),1:train_num,i)=w(1:node(i),1:node(i-1),i)*a(1:node(i-1),1:train_num,i-1)+b_broadcast;
end
if i==network_num
%a[network_num]=sigmod(z[network_num])
a(1:node(i),1:train_num,i)=f_sigmod(z(1:node(i),1:train_num,i));
else
%a[n]=leaky_relu(z[n])
a(1:node(i),1:train_num,i)=f_leaky_relu(z(1:node(i),1:train_num,i));
end
end

%% %error
e=0;
for k=1:train_num
%损失函数e=-1/m∑(ylog(y_hat)+(1-y)log(1-y_hat))
e=e-(train_y(k)*log(a(1,k,network_num))+(1-train_y(k))*log(1-a(1,k,network_num)));
end
e=e/train_num;

%% %back propagation
for i=network_num:-1:1
if i==network_num
%dz[network_num]=y_hat-y
dz(1:node(i),1:train_num,i)=a(1,:,network_num)-train_y;
else
%dz[n]=w[n+1]'dz[n+1].*df_leaky_relu(z[n])
dz(1:node(i),1:train_num,i)=w(1:node(i+1),1:node(i),i+1)'*dz(1:node(i+1),1:train_num,i+1).*df_leaky_relu(z(1:node(i),1:train_num,i));
end
if i==1
%dw[1]=dz[1]*x'
dw(1:node(i),1:feat_num,i)=dz(1:node(i),1:train_num,i)*train_x';
else
%dw[n]=dz[n]*a[n-1]'
dw(1:node(i),1:node(i-1),i)=dz(1:node(i),1:train_num,i)*a(1:node(i-1),1:train_num,i-1)';
end
db(1:node(i),1,i)=sum(dz(1:node(i),1:train_num,i),2)/train_num;
end

%% %weight
%利用梯度下降法更新权值
for i=1:network_num
if i==1
w(1:node(i),1:feat_num,i)=w(1:node(i),1:feat_num,i)-learn_rate*dw(1:node(i),1:feat_num,i);
else
w(1:node(i),1:node(i-1),i)=w(1:node(i),1:node(i-1),i)-learn_rate*dw(1:node(i),1:node(i-1),i);
end
b(1:node(i),1,i)=b(1:node(i),1,i)-learn_rate*db(1:node(i),1,i);
end
end

%% %判断训练是否完成
if e>0.01
disp('请调整参数重新训练');
else
disp('训练完成,运行bp_test开始测试');
end

mlp_test.m

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
clc;close all;
%输入测试集
test_x=rand(2,500);
%测试集样本数
test_num=size(test_x,2);
%z的矩阵大小,zi,i+1=A node(i)*sample_num,eg z1,2=A3*50
test_z(max_row,test_num,network_num)=0;
%a的矩阵大小,ai,i+1=A node(i)*sample_num,eg a1,2=A3*50
test_a(max_row,test_num,network_num)=0;

for i=1:network_num
b_broadcast=zeros(node(i),test_num);
for j=1:test_num
b_broadcast(:,j)=b(1:node(i),1,i);
end
if i==1
%z1=w[1]x+b[1]
test_z(1:node(i),1:test_num,i)=w(1:node(i),1:feat_num,i)*test_x+b_broadcast;
else
%z[n]=w[n]a[n-1]+b[n]
test_z(1:node(i),1:test_num,i)=w(1:node(i),1:node(i-1),i)*test_a(1:node(i-1),1:test_num,i-1)+b_broadcast;
end
if i==network_num
%a[network_num]=sigmod(z[network_num])
test_a(1:node(i),1:test_num,i)=f_sigmod(test_z(1:node(i),1:test_num,i));
else
%a[n]=leaky_relu(z[n])
test_a(1:node(i),1:test_num,i)=f_leaky_relu(test_z(1:node(i),1:test_num,i));
end
end
%test_y为BP的输出
test_y=test_a(1,:,network_num);
%小于0.5则为第一类,大于0.5则为第二类,否则拒绝判决
for i=1:test_num
if test_y(i)>0.5
fprintf('第%d个是第二类,概率为:%f\n',i,test_y(i));
elseif test_y(i)<0.5
fprintf('第%d个是第一类,概率为:%f\n',i,1-test_y(i));
else
fprintf('拒绝判决\n');
end
end
%如果数据的特征是二维的,可以绘图表示
if feat_num==2
hold on;
%画出训练集,用*表示
for i=1:train_num
if train_y(i)==1
plot(train_x(1,i),train_x(2,i),'r*')
else
plot(train_x(1,i),train_x(2,i),'b*')
end
end
%画出测试集,用o表示
for i=1:test_num
if test_y(i)>0.5
plot(test_x(1,i),test_x(2,i),'ro')
elseif test_y(i)<0.5
plot(test_x(1,i),test_x(2,i),'bo')
else
plot(test_x(1,i),test_x(2,i),'go')
end
end
hold off;
else
disp('The Feature Is Not Two-Dimensional ')
end

实验结果

mlp

MLP(Multi-Layer Perceptron)优缺点

  • 优点:
    • 可以实现多分类任务。
    • 参数量和迭代次数满足条件时,可以拟合线性不可分任务。
  • 缺点:
    • 计算量较大,训练时间较长。
    • 算法可能陷入局部最优解,导致训练失败。
-------------本文结束感谢您的阅读-------------
0%